Correlation

Introduction

In our study of statistics so far, we have focused on describing a single variable at a time using measures of central tendency and dispersion. However, in the real world, variables often do not exist in isolation. We are frequently interested in understanding if and how two or more variables are related to each other. For instance, is there a relationship between the amount of rainfall and the yield of a crop? Does a student's study time affect their exam scores? Is there a connection between a family's income and its expenditure?

Correlation analysis is a statistical tool used to measure and describe the strength and direction of the linear relationship between two quantitative variables. It helps us determine whether the variables move together, in opposite directions, or if there is no relationship at all. It is important to note that correlation measures association, not necessarily causation. This chapter will explore the concept of correlation, its types, and the key techniques used to measure it.

Types Of Relationship

The relationship, or correlation, between two variables can be classified based on its direction and its strength.

Direction of Correlation

Positive Correlation: Two variables are said to be positively correlated if they tend to move in the same direction. An increase in one variable is associated with an increase in the other, and a decrease in one is associated with a decrease in the other.
Examples:
- The height and weight of individuals.
- Household income and expenditure.
- Amount of rainfall and yield of rice (up to a point).
Negative Correlation: Two variables are negatively correlated if they tend to move in opposite directions. An increase in one variable is associated with a decrease in the other.
Examples:
- The price of a commodity and its quantity demanded.
- The temperature and the sale of woollen clothes.
- The number of hours spent watching TV and exam scores.
No Correlation: When there is no discernible relationship between the two variables, they are said to have no correlation or zero correlation. A change in one variable is not associated with any particular change in the other.
Examples:
- A person's shoe size and their intelligence level.
- The price of rice and the demand for cars.

Strength of Correlation

The strength of the correlation refers to how closely the two variables are related. It is typically described as:

Perfect Correlation: When the relationship between the two variables is perfectly linear. All points on a scatter diagram would fall on a straight line. The correlation coefficient would be exactly +1 (perfect positive) or -1 (perfect negative).
High Degree of Correlation: A strong relationship where the points on a scatter diagram are very close to a straight line.
Moderate Degree of Correlation: A definite but not very strong relationship.
Low Degree of Correlation: A weak relationship that is barely discernible.
Zero Correlation: The absence of any linear relationship.

Techniques For Measuring Correlation

There are several techniques to measure and visualise correlation. The main ones are the Scatter Diagram, Karl Pearson’s Coefficient of Correlation, and Spearman’s Rank Correlation.

Scatter Diagram

A scatter diagram is a simple graphical tool used to visualise the relationship between two variables. It is a graph where the values of one variable (X) are plotted along the horizontal axis, and the values of the other variable (Y) are plotted along the vertical axis. Each pair of (X, Y) values is represented by a single point on the graph. The resulting pattern of these points can give a good idea of the presence, direction, and strength of the correlation.

A scatter plot showing points clustered in an upward-sloping pattern, indicating a positive correlation.

A scatter plot showing points clustered in a downward-sloping pattern, indicating a negative correlation.

A scatter plot showing points scattered randomly with no discernible pattern, indicating no correlation.

Karl Pearson’s Coefficient Of Correlation

This is the most widely used mathematical method for measuring the intensity or magnitude of a linear relationship between two quantitative variables. It is also known as the product-moment correlation coefficient and is denoted by 'r'.

Derivation and Formula:

The coefficient 'r' is defined as the ratio of the covariance between the two variables (X and Y) to the product of their standard deviations.

$ \text{Cov}(X, Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{N} $

$ \sigma_x = \sqrt{\frac{\sum (x_i - \bar{x})^2}{N}} \quad \text{and} \quad \sigma_y = \sqrt{\frac{\sum (y_i - \bar{y})^2}{N}} $

$ r = \frac{\text{Cov}(X, Y)}{\sigma_x \sigma_y} = \frac{\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{N}}{\sqrt{\frac{\sum (x_i - \bar{x})^2}{N}} \sqrt{\frac{\sum (y_i - \bar{y})^2}{N}}} $

Simplifying this gives the main formula:

$ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $

For computational purposes, a more direct formula is used:

$ r = \frac{N\sum xy - (\sum x)(\sum y)}{\sqrt{N\sum x^2 - (\sum x)^2} \sqrt{N\sum y^2 - (\sum y)^2}} $

Properties Of Correlation Coefficient

The value of the correlation coefficient 'r' always lies between -1 and +1, i.e., $ -1 \le r \le +1 $.

Example 3. Calculate Karl Pearson's coefficient of correlation between the price and quantity supplied for the following data.

Price (₹) (X)	4	6	8	10	12
Supply (kg) (Y)	20	30	40	50	60

Answer:

X	Y	$x^2$	$y^2$	$xy$
4	20	16	400	80
6	30	36	900	180
8	40	64	1600	320
10	50	100	2500	500
12	60	144	3600	720
$\sum x=40$	$\sum y=200$	$\sum x^2=360$	$\sum y^2=9000$	$\sum xy=1800$

$N=5$. Using the formula:

$ r = \frac{5(1800) - (40)(200)}{\sqrt{5(360) - (40)^2} \sqrt{5(9000) - (200)^2}} $

$ r = \frac{9000 - 8000}{\sqrt{1800 - 1600} \sqrt{45000 - 40000}} = \frac{1000}{\sqrt{200} \sqrt{5000}} $

$ r = \frac{1000}{\sqrt{1000000}} = \frac{1000}{1000} = 1 $

Since $r = +1$, there is a perfect positive correlation between price and supply.

Spearman’s Rank Correlation

Developed by Charles Spearman, this method measures the correlation between the ranks assigned to the observations of two variables, rather than their actual values. It is a non-parametric measure and is particularly useful in two situations:

When the data is qualitative and cannot be measured numerically but can be ranked (e.g., beauty, intelligence, leadership).
When the data is quantitative but contains extreme values (outliers), as ranking reduces the impact of such outliers.

The formula for Spearman's Rank Correlation Coefficient (R) is:

$ R = 1 - \frac{6 \sum D^2}{N(N^2 - 1)} $

where $D$ is the difference between the ranks of the two variables ($R_x - R_y$), and $N$ is the number of pairs of observations.

Case 1: When The Ranks Are Given

If the ranks are already provided, we simply calculate the differences (D), square them ($\sum D^2$), and apply the formula.

Case 2: When The Ranks Are Not Given

If we have raw quantitative data, we must first assign ranks to each variable separately. We can rank from highest to lowest or lowest to highest, but the same method must be used for both variables.

Example 4. Calculate the rank correlation between marks in Maths (X) and Physics (Y).

Maths (X)	85	60	72	50	95
Physics (Y)	90	75	80	65	92

Answer:

X	Y	Rank X ($R_x$)	Rank Y ($R_y$)	$D = R_x - R_y$	$D^2$
85	90	2	2	0	0
60	75	4	4	0	0
72	80	3	3	0	0
50	65	5	5	0	0
95	92	1	1	0	0
					$\sum D^2 = 0$

$N=5$.

$ R = 1 - \frac{6 \times 0}{5(5^2 - 1)} = 1 - 0 = 1 $.

There is a perfect positive rank correlation.

Case 3: When The Ranks Are Repeated (Tied Ranks)

If two or more observations have the same value, they are given the same rank, which is the average of the ranks they would have occupied. For every tie, a correction factor (C.F.) must be calculated and added to $\sum D^2$.

Correction Factor: $ C.F. = \frac{m(m^2 - 1)}{12} $, where 'm' is the number of times an item is repeated.

Modified Formula:

$ R = 1 - \frac{6 \left( \sum D^2 + \sum C.F. \right)}{N(N^2 - 1)} $

Example 5. In a dataset, the value 80 appears 3 times in variable X and the value 65 appears 2 times in variable Y. Calculate the correction factors.

Answer:

For variable X, value 80 is repeated 3 times ($m=3$).

$ C.F._x = \frac{3(3^2 - 1)}{12} = \frac{3(8)}{12} = 2 $.

For variable Y, value 65 is repeated 2 times ($m=2$).

$ C.F._y = \frac{2(2^2 - 1)}{12} = \frac{2(3)}{12} = 0.5 $.

The total correction factor to be added to $\sum D^2$ is $ \sum C.F. = 2 + 0.5 = 2.5 $.

Conclusion

Correlation analysis is a powerful statistical technique that provides a numerical measure of the degree of association between two variables. It helps us understand complex real-world phenomena where variables influence each other.

Techniques like scatter diagrams offer a quick visual insight, while Karl Pearson's coefficient provides a precise measure for linear relationships in quantitative data. Spearman's rank correlation offers a robust alternative for qualitative data or data with outliers. The choice of method depends on the nature of the data and the research question.

It is crucial to remember the most important caveat of this analysis: correlation does not imply causation. Just because two variables are highly correlated does not mean that one causes the other. There could be a third, unobserved variable influencing both. Despite this limitation, correlation is an indispensable tool for exploratory data analysis and serves as a fundamental building block for more advanced techniques like regression.